First, we need to import all the libraries needed for the analysis and load the data file:
In [1]:
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
In [2]:
# Read the csv file
titanic = pd.read_csv("titanic-data.csv")
The next step is to explore the dataset:
In [3]:
titanic.shape
Out[3]:
In [4]:
titanic.columns
Out[4]:
In [5]:
titanic
Out[5]:
We can see that Passenger ID, Name and Cabin have little value to the analysis, so we drop these columns off the dataset:
In [6]:
titanic = titanic.drop(['PassengerId','Name','Ticket', 'Cabin', 'Embarked'], axis=1)
In [7]:
titanic['Survived'].describe()
Out[7]:
We can see that both the Age column has a lot of NAs. We would need to fill in the blank with random values generated within their standardized value.
In [8]:
titanic['Age'].describe()
Out[8]:
In [9]:
average_age = titanic["Age"].mean()
std_age = titanic["Age"].std()
count_nan_age = titanic["Age"].isnull().sum()
# generate random numbers between (mean - std) & (mean + std)
rand = np.random.randint(average_age - std_age, average_age + std_age, size = count_nan_age)
In [10]:
# Fill NAs in age with median age
titanic['Age'][np.isnan(titanic["Age"])] = rand
In [11]:
titanic['Age'].describe()
Out[11]:
In [12]:
sns.distplot(titanic['Age'])
plt.show()
Someone's family size would be equal to their number of spouses/siblings and parents/children on the ship, plus themselves:
In [13]:
# Family size
titanic['Family_size'] = titanic['SibSp'] + titanic['Parch'] + 1
Now we would extract the survived dataset for future analysis:
In [14]:
survived = titanic[titanic['Survived'] == 1]
According to Wikipedia, "Women and children first" is a code of conduct dating from 1860, whereby the lives of women and children were to be saved first in a life-threatening situation, typically abandoning ship, when survival resources such as lifeboats were limited. The wiki page actually gives some insights and statistics on the survival rate of the Titanic; however, in this analysis, I would reconfirm them, and attempt to find out which other factors that determine the survival rate in the Titanic tragedy.
The questions I am going to answer in this analysis are:
Assuming people are neutral on the gender of a kid, I would split the passengers into 3 types:
In [15]:
def passenger_type(person):
if person['Age'] <= 16:
return "child"
elif person['Sex'] == "female":
return "female_adult"
else:
return "male_adult"
titanic['Type'] = titanic.apply(passenger_type, axis = 1)
titanic
Out[15]:
In [16]:
titanic['Type'].value_counts()
Out[16]:
In [17]:
sns.set(style="darkgrid")
ax = sns.countplot(x="Type", data = titanic)
plt.show()
We can see that male adults are the initial largest type of people on the ship, followed by female adults and child.
Now looking into the survival rate:
In [18]:
survived = titanic[titanic['Survived'] == 1]
non_survived = titanic[titanic['Survived'] == 0]
In [19]:
survived['Type'].value_counts()
Out[19]:
In [20]:
non_survived['Type'].value_counts()
Out[20]:
In [21]:
sns.set(style="darkgrid")
ax = sns.countplot(x="Survived", hue = "Type", data = titanic)
plt.show()
Comparing to the initial number of people of each type, we can see that children have more than 50% survival rate, female adults have an impressive survival rate around 75%, while male adults have a small survival rate of around 16% comparing to their intial numbers. So we can see that there was an inherent "women and children first" code when it came to saving people on the ship.
In [22]:
sns.distplot(survived['Age'])
plt.show()
The histogram of the age distribution of the survival group also confirms that younger people had a higher advantage in survival comparing to older ages.
We can assume that someone's class on the Titanic represented their socio-economic status. Also, we would assume that the fares have a direct correlation with the classes; so we only need to examine one of them.
In [23]:
titanic['Pclass'].value_counts()
Out[23]:
In [24]:
sns.set(style="darkgrid")
ax = sns.countplot(x = "Pclass", data = titanic)
plt.show()
Approximately 55% of the passengers belonged to the third class, while the rest of the ship belong to the first and second classes. Now we'll see if the first and second class passengers also paid a premimum when it comes to safety?
In [25]:
survived['Pclass'].value_counts()
Out[25]:
In [26]:
sns.set(style="darkgrid")
ax = sns.countplot(x = "Pclass", hue = "Survived", data = titanic)
plt.show()
The survival rate of the first class passengers was more than 60%, while the survival rate of the third class ones was merely around 25%. So we can see that there was a bias on weathiness and soci-economic statuses, even in life-threatning situations.
Now, what if we factor in both passenger classes and types (male, female or children), which would have more weight in survival rate?
In [27]:
titanic.groupby(['Pclass', 'Type']).Type.count()
Out[27]:
In [28]:
sns.set(style="darkgrid")
ax = sns.countplot(x = "Pclass", hue = "Type", data = titanic)
plt.show()
In [29]:
titanic.groupby(['Pclass', 'Type']).agg({'Survived': 'sum'})
Out[29]:
In [30]:
sns.set(style="darkgrid")
ax = sns.countplot(x = "Pclass", hue = "Type", data = survived)
plt.show()
We can see the women and children of the first class had a significantly impressive survival rate (more than 90% and 80% respectively), when the women and children of the third class had a much lower survival rate (more than 45% and around 40% respectively). However, the women and children from the third class did have a higher survival rate than the men from higher classes. Men from the first class had a survival rate of around 35%, which was actually below the overall survival rate of 38.38%. Men from the second and third classes suffered very low survival rates, which was around 8 % and around 12 % respectively comparing to their initial numbers.
Did people have a higher chance of survival if they traveled with family rather than traveling alone? We'll find out.
In [31]:
titanic['Family_size'].value_counts()
Out[31]:
We can see that the majority of the ship traveled by themselves, followed by families of 2 or 3. The families that had more than 3 members made up a small part of the ship. Now look into the survival statistics:
In [32]:
survived['Family_size'].value_counts()
Out[32]:
In [33]:
sns.boxplot(x="Survived", y="Family_size", data=titanic)
plt.show()
In [34]:
sns.kdeplot(survived['Family_size'], shade=True)
plt.show()
Both the boxplot and the distribution curve shows that small-sizing families (under 4) made up around 75% of the survivals. Big families seem to have been penalized harshly on survival rate.
In this analysis, we can see that there was a clear trend of "Women and children first" when it came to helping and rescuing people from the Titanic. The data also suggests an impact of soci-economic classes and family sizes on someone's chance of survival, although they didn't have as much impact as the "women and children first" rule. Also, women and children from lower classes still had a better chance of survival than men from lower classes.
The analysis has a few limitations. A lot of values were missing in the age sections, and randomized numbers must create a margin of error in the analysis. If I have more knowledge to make use of variables such as names, embarked or cabins, the analysis would also be improved for the better.